Skip to content

feat(ci): add HPC-grade CI/CD workflows, GPU hardware validation, and benchmark regression gate#6

Merged
Shards-inc merged 1 commit intomainfrom
codex/design-ci/cd-pipeline-for-gpu-kernels
Mar 8, 2026
Merged

feat(ci): add HPC-grade CI/CD workflows, GPU hardware validation, and benchmark regression gate#6
Shards-inc merged 1 commit intomainfrom
codex/design-ci/cd-pipeline-for-gpu-kernels

Conversation

@Shards-inc
Copy link
Member

Motivation

  • Provide a production-grade CI/CD design for kernel-style compute projects that enforces deterministic builds, hardware-specific validation, and automated performance gates.
  • Surface expensive/hardware checks (GPU numerical correctness and benchmark regression) to separate, scheduled workflows so PR latency remains acceptable.

Description

  • Added a new architecture reference docs/pipelines/HPC_CICD_ARCHITECTURE.md describing the layered pipeline, deterministic build strategy, validation gates, rollback policy, and scaling guidance.
  • Introduced multiple GitHub Actions workflows: .github/workflows/hpc-matrix.yml (expanded CPU/Torch & CUDA compatibility matrix), .github/workflows/gpu-hardware.yml (self-hosted GPU hardware validation + benchmark artifact upload), .github/workflows/benchmark.yml (benchmark regression enforcement), and .github/workflows/docs.yml (strict docs build).
  • Added a benchmark regression utility scripts/ci/check_benchmark_regression.py and a placeholder baseline .ci/benchmarks/baseline.json, plus mkdocs.yml and links from README.md / docs/README.md to surface the new doc.
  • Updated CHANGELOG.md to record the CI/CD additions and committed the new files to the PR branch.

Testing

  • Ran bash scripts/smoke.sh (project smoke tests) which completed successfully. ✅
  • Compiled and executed the benchmark checker with python -m compileall scripts/ci/check_benchmark_regression.py and ran python scripts/ci/check_benchmark_regression.py .ci/benchmarks/baseline.json 0.05, both succeeded (baseline is empty so gate is skipped). ✅
  • Attempted docs build validation; mkdocs installation failed in this environment due to package index/proxy restrictions, so mkdocs build --strict could not be validated here. ⚠️
  • Ran ruff check . which failed repository-wide due to pre-existing lint issues unrelated to the changes in this PR; these lint failures are external to the new CI artifacts and should be addressed separately. ❌
  • Note: GPU hardware workflow is designed to run on self-hosted GPU runners and was not executed in this environment (requires labeled runners: a100, h100, rtx4090).

Codex Task

@Shards-inc Shards-inc merged commit 26dac06 into main Mar 8, 2026
7 of 25 checks passed
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27c24d6f85

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pip install pytest pytest-benchmark

- name: Run benchmark suite
run: pytest tests -m "benchmark" --benchmark-json benchmark.json -q

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run benchmark suite without an empty marker filter

pytest -m only executes tests matching the mark expression (pytest --help), and this repository currently has no benchmark-marked tests under tests/, so this command deselects everything and exits with code 5 (No tests collected). In this workflow context (PR and nightly), that causes the benchmark job to fail before the regression checker can run.

Useful? React with 👍 / 👎.

run: pytest -m "not gpu" --cov=kernels --cov=implementations --cov-fail-under=85

- name: Integration tests
run: pytest tests -m "integration and not gpu" -q

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove HPC marker filters that select zero tests

The integration step uses -m "integration and not gpu", but there are no matching marks in the current tests/ tree, so pytest deselects all tests and returns exit code 5; the same issue also affects the fallback step in this workflow. As written, both matrix jobs can fail even when the codebase is otherwise healthy.

Useful? React with 👍 / 👎.


- name: GPU integration and numerical tests
run: |
pytest tests -m "gpu or numerical" -q

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix GPU test selection to avoid no-tests-collected failures

This GPU validation step filters on gpu/numerical markers, but there are no such marks in the present test suite, so pytest exits 5 after deselecting everything. On scheduled/manual self-hosted runs, the workflow will fail at test selection rather than validating hardware behavior.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant